JLC 445/696 Final Project Template
04 December 2024
The Final Project is an extension of Project 3, that also involves the integration of work you’ve done throughout this semester and beyond. Use the format provided in this document as much or as little as you’d like, as long as the required components are covered.
If you want to get crazy with Markdown themes, here’s a great link.
1. Research Question
The goal of this project is to forecast future crime activity in Washington, D.C. for calendar year 2025.
These instructions will demonstrate how to forecast using an ARIMA model. An ARIMA model is an Autoregressive Integrated Moving Average, which is a univariate model predicting future values of a single variable over time. Let’s breakdown these terms:
- Autoregressive: “auto” means self, and “regressive” refers to using prior values.
- Integrated: “integrated” refers to the process of making data stationary, which identifies trends by differencing the data
- Moving Average: “moving average” is a calculation for a series of averages using subsets of data to identify trends over time. It is a more precise measure for smoothing out fluctuations.
Your research question should consider behavior, space, and time. You’ll either measure or control for each of these.
- Behavior: what crime type(s)
- Space: where
- Time: when is the data from, when are you forecasting
For example, using 2008-2024 data, forecast robberies counts and locations in Washington, D.C. for each month in 2025.
2. Data
Acquire
- Load the libraries you need for success:
library(tidyverse)
library(leaflet)
library(sf)
library(tigris)
library(zoo)
library(aTSA)
library(tseries)
library(forecast)- Get the latest and greatest D.C. crime data, like we did in Projects 2 and 3:
url <- "https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-"
years <- c(2008:2024)
full.urls <- paste0(url, years, ".csv")
dc.data <- data.frame()
for(file in full.urls) {
tmp.data <- read.csv(file, stringsAsFactors = FALSE)
dc.data <- rbind(tmp.data, dc.data)
}Wrangle
- Separate the date, format it, and enrich the dataset with more temporal measures:
dc.data <- separate(dc.data, REPORT_DAT, into = c("DATE", "TIME"), sep = " ")
dc.data$DATE <- as.Date(dc.data$DATE, format = "%Y/%m/%d")
dc.data$YEAR <- substr(dc.data$DATE, 0, 4)
dc.data$MONTH <- month(dc.data$DATE)
dc.data$MONTHS <- months(dc.data$DATE)
dc.data$DAY <- day(dc.data$DATE)
dc.data$DOW <- weekdays(dc.data$DATE)
dc.data$HOUR <- substr(dc.data$TIME, 0, 2)- Determine the total count and percentage of each crime type:
dc.sum <- dc.data %>%
group_by(OFFENSE) %>%
summarise(CRIME.COUNT = n()) %>%
mutate(PCT = round(CRIME.COUNT/sum(CRIME.COUNT)*100,2))- Use the table from #4 to find and filter the crime type(s) to forecast (choose your own!):
group <- c("ROBBERY")
dc.subset <- dc.data %>% filter(dc.data$OFFENSE %in% group)3. Methods
Building a relevant methodology is largely dependent on your research question and data. Any good methodology involves multiple steps for acquiring (done above), wrangling (done above), calculating, visualizing, and analyzing data. There is no single right answer here; rather, its the ability to craft a unique, distinct workflow that results in innovative work.
To use an ARIMA model, we will identify values for p, d, and q:
- p: Lagged observations
- d: Differencing the data set
- q: Prior error handling
Calculate
- Calculate crimes per day:
crime <- dc.subset %>%
group_by(DATE) %>%
summarise(CRIME.COUNT = n())- Fill in blank days:
crime <- crime %>%
complete(DATE = seq.Date(min(DATE), max(DATE), by = "day")) %>%
mutate(WEEKDAY = lubridate::wday(DATE, label = T, week_start = 1),
MONTH = lubridate::month(DATE, label = T, abbr = F),
WEEK = isoweek(DATE),
DAY = day(DATE),
YEAR = year(DATE))
# replace the NAs with 0s
crime <- replace(crime, is.na(crime), 0)- Change the data type to a proper time series, update the sequencing:
cleaned.crime <- zoo(crime$CRIME.COUNT,
seq(from = as.Date(min(crime$DATE)),
to = as.Date(max(crime$DATE)-1), by = 1))- Generate basic summary stats and a basic graph:
summary(cleaned.crime)## Index cleaned.crime
## Min. :2008-01-01 Min. : 0.000
## 1st Qu.:2012-03-25 1st Qu.: 5.000
## Median :2016-06-17 Median : 8.000
## Mean :2016-06-17 Mean : 8.301
## 3rd Qu.:2020-09-09 3rd Qu.:11.000
## Max. :2024-12-02 Max. :35.000
plot(cleaned.crime)
title("My Crime Per Day") # this adds a title to your graph- Make the data stationary (normalized, as a deviation from previous values), and a new basic graph. You can do more/higher differences as necessary (but probably don’t need to). Examine the graphs and compare the spikes. These are just examples:
stationary1 <- diff(cleaned.crime, differences = 1)
plot(stationary1)stationary2 <- diff(cleaned.crime, differences = 2)
plot(stationary2)stationary3 <- diff(cleaned.crime, differences = 3)
plot(stationary3)Look for the data with the narrowest y axis range, as consistently as close to 0 as possible. In this example, stationary1 is the best dataset.
- Conduct an Augmented Dickey-Fuller test to determine if the data is stationary. The resultant score should be a negative number, and the lower the number the stronger the data is stationary. In this example, ‘cleaned.calls’ is run as a comparison to data that is not differenced.
adf.test(as.matrix(cleaned.crime))## Warning in adf.test(as.matrix(cleaned.crime)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(cleaned.crime)
## Dickey-Fuller = -8.2554, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary1))## Warning in adf.test(as.matrix(stationary1)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(stationary1)
## Dickey-Fuller = -28.316, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary2))## Warning in adf.test(as.matrix(stationary2)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(stationary2)
## Dickey-Fuller = -37.643, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary3))## Warning in adf.test(as.matrix(stationary3)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: as.matrix(stationary3)
## Dickey-Fuller = -47.749, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
In this example, all scenarios are negative. When combined with the plot analysis in Step 5, stationary1 is best. Ideally you want to manipulate the data as little as possible. While ‘stationary2’ and ‘stationary3’ are even more negative, they are not negative enough to warrant choosing them.
- Lagged autocorrelations - specifically using the Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions - will help determine the p and q values. For both calculations and visuals, you will use the best stationary dataset you chose in Steps 5 and 6 above.
- Determine the p order by examining the PACF graph and data. Specifically, look for the last significant lag:
pacf(stationary1)pacf(stationary1, pl=FALSE)##
## Partial autocorrelations of series 'stationary1', by lag
##
## 1 2 3 4 5 6 7 8 9 10 11
## -0.455 -0.318 -0.225 -0.163 -0.182 -0.180 -0.113 -0.068 -0.081 -0.060 -0.049
## 12 13 14 15 16 17 18 19 20 21 22
## -0.082 -0.099 -0.036 -0.018 -0.042 -0.044 -0.024 -0.039 -0.075 -0.038 -0.035
## 23 24 25 26 27 28 29 30 31 32 33
## -0.021 -0.005 -0.023 -0.040 -0.051 -0.021 0.001 -0.019 -0.015 0.001 -0.009
## 34 35 36 37
## -0.044 0.003 0.004 -0.016
- Determine the q order by identifying the last significant lag in the ACF:
acf(stationary1)acf(stationary1, pl=FALSE)##
## Autocorrelations of series 'stationary1', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 -0.455 -0.045 0.011 0.005 -0.040 0.003 0.042 0.002 -0.030 0.016
## 11 12 13 14 15 16 17 18 19 20 21
## -0.001 -0.030 0.002 0.050 -0.013 -0.027 0.010 0.013 -0.024 -0.015 0.044
## 22 23 24 25 26 27 28 29 30 31 32
## -0.012 -0.001 0.007 -0.017 -0.011 0.002 0.029 0.000 -0.024 0.012 0.007
## 33 34 35 36 37
## -0.020 -0.017 0.051 -0.017 -0.019
In this example, the p should be 6, and the q should be 1.
- Build an ARIMA function using the inputs derived from previous steps (based on analysis from the graph in Step 4, it’s reasonable to assume our data has season trends - so we’ll set it to TRUE):
arima.crime <- auto.arima(cleaned.crime, d = 1, max.p = 6, max.q = 1, seasonal = T)
summary(arima.crime)## Series: cleaned.crime
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## 0.0917 -0.9387
## s.e. 0.0138 0.0052
##
## sigma^2 = 11.56: log likelihood = -16331.87
## AIC=32669.75 AICc=32669.75 BIC=32689.94
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.008816429 3.399132 2.653829 -Inf Inf 0.7677494 -0.0004144812
Note that the ‘auto.arima’ function runs a series of ARIMA models and choose the best fit for the data. The best fitting model can be interpreted as ARIMA(d, p, q).
- Do a quick residuals checks on the model to see how it performed:
checkresiduals(arima.crime)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 31.849, df = 8, p-value = 9.914e-05
##
## Model df: 2. Total lags used: 10
Proper residuals derived from your model include:
- the top graph not demonstrating any clear, cyclical, repeating patterns
- the bottom right graph generally being a normally distributed bell curve
- Use the ARIMA function to do a sample forecast for the next week of data:
# h = the number of units of time to measure
forecast.7days <- forecast(arima.crime, h=7)
round(sum(forecast.7days$upper[,2]),0)## [1] 91
forecast.7days$mean## Time Series:
## Start = 20060
## End = 20066
## Frequency = 1
## [1] 6.651305 6.161095 6.116167 6.112050 6.111672 6.111638 6.111634
- Identify the number of days between the last date in your dataset and the end of 2025:
forecast.window <- as.numeric(as.Date("2025-12-31")-max(crime$DATE))- Forecast the number of calls per day for everyday in that window, and plot it:
forecast.2025 <- forecast(arima.crime, h=forecast.window)
autoplot(forecast.2025)- Extract the forecasted values as a table, clean up the column names, add the forecast date, and month:
forecast.values <- as.data.frame(forecast.2025$mean)
forecast.values$ID <- seq.int(nrow(forecast.values))
forecast.upper <- as.data.frame(forecast.2025$upper)
forecast.upper$ID <- seq.int(nrow(forecast.upper))
forecast.values <- forecast.values %>%
left_join(forecast.upper, by = 'ID')
colnames(forecast.values) <- c("MEAN", "ID", "CI80", "CI90")
forecast.values$DATE <- as.Date(max(crime$DATE) + forecast.values$ID)
forecast.values$MONTH <- months(forecast.values$DATE)- Filter to 2025, summarize forecasts by month:
forecast.values.2025 <- subset(forecast.values, forecast.values$DATE > '2024-12-31')
forecast.months <- forecast.values.2025 %>%
group_by(MONTH) %>%
summarise(MEAN = round(sum(MEAN),0), FORECAST.90 = round(sum(CI90),0), FORECAST.80 = round(sum(CI80),0))
forecast.months$DIFF <- forecast.months$FORECAST.90 - forecast.months$FORECAST.80Visualize
- Graph of crime with a trend line:
graph.crime <- ggplot(crime, aes(x=DATE, y=CRIME.COUNT)) +
geom_line() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
xlab("Years") +
ylab("Crime Count") +
ggtitle("TITLE HERE") +
geom_area(fill="lightblue", color="black")
graph.crime +
geom_smooth(method = lm, col = "red", se = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))- Graph actual crime with predicted crime:
crime2024 <- crime[c(1,2)]
crime2025 <- forecast.values[c(5,1)]
names(crime2025) <- c("DATE", "CRIME.COUNT")
new.crime <- rbind(crime2024, crime2025)
graph.new.crime <- ggplot(new.crime, aes(x=DATE, y=CRIME.COUNT)) +
geom_line() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
xlab("Years") +
ylab("Crime Count") +
ggtitle("TITLE HERE") +
geom_area(fill="lightblue", color="black")
graph.new.crime +
geom_smooth(method = lm, col = "red", se = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))## `geom_smooth()` using formula = 'y ~ x'
- Tweak the data table for making a graph:
forecast.months$MONTH <- factor(forecast.months$MONTH, levels = forecast.months$MONTH)
forecast.long <- pivot_longer(forecast.months, cols = c(FORECAST.80, DIFF),
names_to = "Category", values_to = "Value")
forecast.long$Category <- gsub("DIFF", "90% Confidence Interval", forecast.long$Category)
forecast.long$Category <- gsub("FORECAST.80", "80% Confidence Interval", forecast.long$Category)- Graph your monthly forecasts, first for the upper bounds, and then for the mean:
ggplot(forecast.long, aes(x = MONTH, y = Value, fill = fct_rev(Category))) +
geom_bar(stat = "identity") +
geom_text(aes(label = Value), size = 3, colour = 'white', position = position_stack(vjust = 0.5)) +
labs(title = "Washington D.C. 2025 Monthly Forecast Upper Bounds",
x = "Month",
y = "Crime Count") +
scale_fill_manual(values = c("90% Confidence Interval" = "blue",
"80% Confidence Interval" = "grey"),
name = "Forecasts") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
"September", "October", "November", "December")) +
coord_cartesian(ylim = c(0, 550)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))ggplot(forecast.months, aes(x = MONTH, y = MEAN)) +
geom_bar(stat = "identity") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
"September", "October", "November", "December")) +
coord_cartesian(ylim = c(150, 200)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Washington D.C. 2025 Mean Monthly Forecast",
x = "Month",
y = "Crime Count")- Map your monthly hotspots.
- Get the city outline and roads
dc.roads <- roads("DC", "District of Columbia")
dc.roads.major <- dc.roads %>% filter(RTTYP %in% c("I","S","U"))
dc.outline <- county_subdivisions("DC", "District of Columbia")- Choose a hotspot method
- Facet by year or month
ggplot() +
geom_sf(data = dc.outline, color = "grey") +
geom_hex(aes(x = LONGITUDE, y = LATITUDE), data = dc.subset, bins = 8) +
scale_fill_continuous(type = "viridis") +
geom_sf(data = dc.roads, color = "black", alpha = 0.4) +
theme_classic() +
facet_wrap(~ MONTH)4. Findings
INSTRUCTIONS
- Describe the model
- Describe the outputs
- Contextualize the results
- Describe the hotspots
- Make it real
- Make it usable
Writing Tips
Related, some general notes about writing. Welcome to grad school, where the quality of your writing is held to a higher standard. Write directly. Thus, avoid using these words and phrases:
- Simply. Very. Indeed. Specifically. Perhaps. Arguably. Usually. Essentially. Could possibly. Could potentially. Necessarily. Only. Especially. In order to. Likely. Primarily. Generally. It is important that. It is essential to note.
Also, words have meanings. Some are stronger than others. Others aren’t real.
Further, here’s some other strategies to consider:
- Read sentences aloud… but don’t write like you speak.
- Ask yourself - does this sentence make sense, by itself.
- It’s ok to occasionally use quotes, don’t copy specific language or words.
- If you don’t know the meaning of a word, don’t use it!
And finally, this is just fun.
Creating the Output
- When submitting your projects, you are submitting either an HTML or PDF output of your Markdown file, as an upload to Canvas - not via email.
- When you are ready to see what the output looks like, click the ‘Knit’ button. The output file will be saved to the same folder this script is saved to.
- Never ever ever include any install.packages functions. Either comment them out with a #, or delete them.
Grading: 10pts
1. Research Question
1pt
Provide a brief paragraph (1-2 sentences) on the research question. State the question, in terms of behavior, space, and time. Define key terms and concepts, to include how variables will be measured.
2. Data
1pt
Provide a brief paragraph on the data used in this study. This includes the specific source(s), and the specific temporal and spatial constraints. Address any major advantages and disadvantages with using these data compared to other sources. Be specific enough that a reader could replicate this work.
3. Methods
1pt
Provide a brief paragraph on the research methods leveraged to answer this question. This includes the software, calculations, skills, techniques, and unique workflows used to analyze your data and develop an answer. You do NOT need to describe click-by-click instructions or lines of code describing how you did things; you DO need to describe a logical process that is specific enough that a reader could replicate.
4. Findings
3pts
Provide two paragraphs, 4-5 sentences each (1.5pts each). Paragraph 1 should describe your forecast: what do you expect to occur during 2025 - the highs, the lows, the norms, etc. Describe the count of events and likely locations (specific areas of the city).
Paragraph 2 should be an integration. Using your understanding of criminology, make sense of this forecast. Provide some context to the numbers and locations. Not necessarily why these crimes will occur, but what are the factors influencing your model. Provide at least one reference/citation to a peer-reviewed/academic research study that supports your response.
5. Visuals
3pts
Provide 1-2 graphs that highlight your model (1.5pts). These graphs can show your historical crime combined with your forecast, or just your forecast.
Provide 1-2 hotspot maps (1.5pts). These maps should be faceted by month or year.
All visuals should be thoughtful and logical (make sense to someone that has never seen them before); somehow, someway referenced in the text; and analytically meaningful (provide useful, relevant, and actionable insights)
6. Formatting
1pt
- Error-free writing. No typos, run-on sentences, or sentence fragments. Proper punctuation. Real words. Need help paraphrasing dense content? Try this. And here’s a great reference, too.
- File submitted as the knitted HTML output of an RMarkdown file, via Canvas.
ONE LAST THING… the source code for creating this file can be found here.
Please email me with any questions.